72 research outputs found
Second-order Temporal Pooling for Action Recognition
Deep learning models for video-based action recognition usually generate
features for short clips (consisting of a few frames); such clip-level features
are aggregated to video-level representations by computing statistics on these
features. Typically zero-th (max) or the first-order (average) statistics are
used. In this paper, we explore the benefits of using second-order statistics.
Specifically, we propose a novel end-to-end learnable feature aggregation
scheme, dubbed temporal correlation pooling that generates an action descriptor
for a video sequence by capturing the similarities between the temporal
evolution of clip-level CNN features computed across the video. Such a
descriptor, while being computationally cheap, also naturally encodes the
co-activations of multiple CNN features, thereby providing a richer
characterization of actions than their first-order counterparts. We also
propose higher-order extensions of this scheme by computing correlations after
embedding the CNN features in a reproducing kernel Hilbert space. We provide
experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained
datasets such as MPII Cooking activities and JHMDB, as well as the recent
Kinetics-600. Our results demonstrate the advantages of higher-order pooling
schemes that when combined with hand-crafted features (as is standard practice)
achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV
Nearest Neighbors Using Compact Sparse Codes
International audienceIn this paper, we propose a novel scheme for approximate nearest neighbor (ANN) retrieval based on dictionary learning and sparse coding. Our key innovation is to build compact codes, dubbed SpANN codes, using the active set of sparse coded data. These codes are then used to index an inverted file table for fast retrieval. The active sets are often found to be sensitive to small differences among data points, resulting in only near duplicate retrieval. We show that this sensitivity is related to the coherence of the dictionary; small coherence resulting in better retrieval. To this end, we propose a novel dictionary learning formulation with incoherence constraints and an efficient method to solve it. Experiments are conducted on two state-of-the-art computer vision datasets with 1M data points and show an order of magnitude improvement in retrieval accuracy without sacrificing memory and query time compared to the state-of-the-art methods
Ordered Pooling of Optical Flow Sequences for Action Recognition
Training of Convolutional Neural Networks (CNNs) on long video sequences is
computationally expensive due to the substantial memory requirements and the
massive number of parameters that deep architectures demand. Early fusion of
video frames is thus a standard technique, in which several consecutive frames
are first agglomerated into a compact representation, and then fed into the CNN
as an input sample. For this purpose, a summarization approach that represents
a set of consecutive RGB frames by a single dynamic image to capture pixel
dynamics is proposed recently. In this paper, we introduce a novel ordered
representation of consecutive optical flow frames as an alternative and argue
that this representation captures the action dynamics more effectively than RGB
frames. We provide intuitions on why such a representation is better for action
recognition. We validate our claims on standard benchmark datasets and
demonstrate that using summaries of flow images lead to significant
improvements over RGB frames while achieving accuracy comparable to the
state-of-the-art on UCF101 and HMDB datasets.Comment: Accepted in WACV 201
- …